Chapter 6. Content Description and Modification
Apache has the ability to tune the information it returns to the abilities of the client — and even to improve the client's efforts. Currently, this affects:
Apache v2 also offers a new mechanism — Section 6.6, which is described at the end of this chapter. 6.1 MIME TypesMIME stands for Multipurpose Internet Mail Extensions, a standard developed by the Internet Engineering Task Force for email but then repurposed for the Web. Apache uses mod_mime.c, compiled in by default, to determine the type of a file from its extension. MIME types are more sophisticated than file extensions, providing a category (like "text," "image," or "application"), as well as a more specific identifier within that category. In addition to specifying the type of the file, MIME permits the specification of additional information, like the encoding used to represent characters. The "type" of a file that is sent is indicated by a header near the beginning of the data. For instance: content-type: text/html indicates that what follows is to be treated as HTML, though it may also be treated as text. If the type were "image/jpg", the browser would need to use a completely different bit of code to render the data. This header is inserted automatically by Apache[1] based on the MIME type and is absorbed by the browser so you do not see it if you right-click in a browser window and select "View Source" (MSIE) or similar. Notwithstanding, it is an essential element of a web page. The list of MIME types that Apache already knows about is distributed in the file ..conf/mime.types or can be found at http://www.isi.edu/in-notes/iana/assignments/media-types/media-types. You can edit it to include extra types, or you can use the directives discussed in this chapter. The default location for the file is .../<site>/conf, but it may be more convenient to keep it elsewhere, in which case you would use the directive TypesConfig. Changing the encoding of a file with one of these directives does not change the value of the Last-Modified header, so cached copies with the old label may linger after you make such changes. (Servers often send a Last-Modified header containing the date and time the content of was last changed, so that the browser can use cached material at the other end if it is still fresh.) Files can have more than one extension, and their order normally doesn't matter. If the extension .itl maps onto Italian and .html maps onto HTML, then the files text.itl.html and text.html.itl will be treated alike. However, any unrecognized extension, say .xyz, wipes out all extensions to its left. Hence text.itl.xyz.html will be treated as HTML but not as Italian.
The TypesConfig directive sets the location of the MIME types configuration file. filename is relative to the ServerRoot. This file sets the default list of mappings from filename extensions to content types; changing this file is not recommended unless you know what you are doing. Use the AddType directive instead. The file contains lines in the format of the arguments to an AddType command: MIME-type extension extension ... The extensions are lowercased. Blank lines and lines beginning with a hash character (#) are ignored.
The AddType directive maps the given filename extensions onto the specified content type. MIME-type is the MIME type to use for filenames containing extensions. This mapping is added to any already in force, overriding any mappings that already exist for the same extension. This directive can be used to add mappings not listed in the MIME types file (see the TypesConfig directive). For example: AddType image/gif .gif It is recommended that new MIME types be added using the AddType directive rather than changing the TypesConfig file. Note that, unlike the NCSA httpd, this directive cannot be used to set the type of particular files. The extension argument is case insensitive and can be specified with or without a leading dot.
The server must inform the client of the content type of the document, so in the event of an unknown type, it uses whatever is specified by the DefaultType directive. For example: DefaultType image/gif would be appropriate for a directory that contained many GIF images with file-names missing the .gif extension. Note that this is only used for files that would otherwise not have a type.
Given a directory full of files of a particular type, ForceType will cause them to be sent as media-type. For instance, you might have a collection of .gif files in the directory .../gifdir, but you have given them the extension .gf2 for reasons of your own. You could include something like this in your Config file: <Directory <path>/gifdir> ForceType image/gif </Directory> You should be cautious in using this directive, as it may have unexpected results. This directive always overrides any MIME type that the file might usually have because of its extension — so even .html files in this directory, for example, would be served as image/gif.
The RemoveType directive removes any MIME type associations for files with the given extensions. This allows .htaccess files in subdirectories to undo any associations inherited from parent directories or the server config files. An example of its use is to have the following in /foo/.htaccess: RemoveType .cgi This will remove any special handling of .cgi files in the /foo/ directory and any beneath it, causing the files to be treated as the default type.
The extension argument is case insensitive and can be specified with or without a leading dot.
The AddEncoding directive maps the given filename extensions to the specified encoding type. mime-enc is the MIME encoding to use for documents containing the extension. This mapping is added to any already in force, overriding any mappings that already exist for the same extension. For example: AddEncoding x-gzip .gz AddEncoding x-compress .Z This will cause filenames containing the .gz extension to be marked as encoded using the x-gzip encoding and filenames containing the .Z extension to be marked as encoded with x-compress. Older clients expect x-gzip and x-compress; however, the standard dictates that they're equivalent to gzip and compress, respectively. Apache does content-encoding comparisons by ignoring any leading x-. When responding with an encoding, Apache will use whatever form (i.e., x-foo or foo) the client requested. If the client didn't specifically request a particular form, Apache will use the form given by the AddEncoding directive. To make this long story short, you should always use x-gzip and x-compress for these two specific encodings. More recent encodings, such as deflate, should be specified without the x-. The extension argument is case insensitive and can be specified with or without a leading dot.
The RemoveEncoding directive removes any encoding associations for files with the given extensions. This allows .htaccess files in subdirectories to undo any associations inherited from parent directories or the server config files. An example of its use might be: /foo/.htaccess: AddEncoding x-gzip .gz AddType text/plain .asc <Files *.gz.asc> RemoveEncoding .gz </Files> This will cause foo.gz to be marked as being encoded with the gzip method, but foo.gz.asc as an unencoded plain-text file. This might, for example, be a hash of the binary file to prevent illicit alteration. Note that RemoveEncoding directives are processed after any AddEncoding directives, so it is possible they may undo the effects of the latter if both occur within the same directory configuration. The extension argument is case insensitive and can be specified with or without a leading dot.
This directive specifies the name of the character set that will be added to any response that does not have any parameter on the content type in the HTTP headers. This will override any character set specified in the body of the document via a META tag. A setting of AddDefaultCharset Off disables this functionality. AddDefaultCharset On enables Apache's internal default charset of iso-8859-1 as required by the directive. You can also specify an alternate charset to be used; e.g. AddDefaultCharset utf-8. The use of AddDefaultCharset is an important part of the prevention of Cross-Site Scripting (XSS) attacks. For more on XSS, refer to http://www.idefense.com/XSS.html.
The AddCharset directive maps the given filename extensions to the specified content charset. charset is the MIME charset parameter of filenames containing the extension. This mapping is added to any already in force, overriding any mappings that already exist for the same extension. For example: AddLanguage ja .ja AddCharset EUC-JP .euc AddCharset ISO-2022-JP .jis AddCharset SHIFT_JIS .sjis Then the document xxxx.ja.jis will be treated as being a Japanese document whose charset is ISO-2022-JP (as will the document xxxx.jis.ja). The AddCharset directive is useful both to inform the client about the character encoding of the document so that the document can be interpreted and displayed appropriately, and for content negotiation, where the server returns one from several documents based on the client's charset preference. The extension argument is case insensitive and can be specified with or without a leading dot.
The RemoveCharset directive removes any character-set associations for files with the given extensions. This allows .htaccess files in subdirectories to undo any associations inherited from parent directories or the server config files. The extension argument is case insensitive and can be specified with or without a leading dot. The corresponding directives follow:
The AddHandler directive wakes up an existing handler and maps the filename(s) extension1, etc., to handler-name. You might specify the following in your Config file: AddHandler cgi-script cgi bzq From then on, any file with the extension .cgi or .bzq would be treated as an executable CGI script.
This does the same thing as AddHandler, but applies the transformation specified by handler-name to all files in the <Directory>, <Location>, or <Files> section in which it is placed or in the .htaccess directory. For instance, in Chapter 10, we write: <Location /status> <Limit get> order deny,allow allow from 192.168.123.1 deny from all </Limit> SetHandler server-status </Location>
The RemoveHandler directive removes any handler associations for files with the given extensions. This allows .htaccess files in subdirectories to undo any associations inherited from parent directories or the server config files. An example of its use might be: /foo/.htaccess: AddHandler server-parsed .html /foo/bar/.htaccess: RemoveHandler .html This has the effect of returning .html files in the /foo/bar directory to being treated as normal files, rather than as candidates for parsing (see the mod_include module). The extension argument is case insensitive and can be specified with or without a leading dot.
AcceptFilter controls a BSD-specific filter optimization. It is compiled in by default — and switched on by default if your system supports it (setsocketopt( ) option SO_ACCEPTFILTER). Currently, only FreeBSD supports this.
See http://httpd.apache.org/docs/misc/perf-bsd44.html for more information.
The compile time flag AP_ACCEPTFILTER_OFF can be used to change the default to off. httpd -V and httpd -L will show compile-time defaults and whether or not SO_ACCEPTFILTER was defined during the compile. 6.2 Content NegotiationThere may be different ways to handle the data that Apache returns, and there are two equivalent ways of implementing this functionality. The multiviews method is simpler (and more limited) than the *.var method, so we shall start with it. The Config file (from ... /site.multiview) looks like this: User webuser Group webgroup ServerName www.butterthlies.com DocumentRoot /usr/www/APACHE3/site.multiview/htdocs ScriptAlias /cgi-bin /usr/www/APACHE3/cgi-bin AddLanguage it .it AddLanguage en .en AddLanguage ko .ko LanguagePriority it en ko <Directory /usr/www/APACHE3/site.multiview/htdocs> Options + MultiViews </Directory> For historical reasons, you have to say: Options +MultiViews even though you might reasonably think that Options All would cover the case. The general idea is that whenever you want to offer variations of a file (e.g., JPG, GIF, or bitmap for images, or different languages for text), multiviews will handle it. Apache v2 offers a relevant directive. 6.2.1 MultiviewsMatchMultiviewsMatch permits three different behaviors for mod_negotiation's Multiviews feature. MultiviewsMatch [NegotiatedOnly] [Handlers] [Filters] [Any] server config, virtual host, directory, .htaccess Compatibility: only available in Apache 2.0.26 and later. Multiviews allows a request for a file, e.g., index.html, to match any negotiated extensions following the base request, e.g., index.html.en, index.html.fr, or index.html.gz. The NegotiatedOnly option provides that every extension following the base name must correlate to a recognized mod_mime extension for content negotiation, e.g., Charset, Content-Type, Language, or Encoding. This is the strictest implementation with the fewest unexpected side effects, and it's the default behavior. To include extensions associated with Handlers and/or Filters, set the MultiviewsMatch directive to either Handlers, Filters, or both option keywords. If all other factors are equal, the smallest file will be served, e.g., in deciding between index.html.cgi of 500 characters and index.html.pl of 1,000 bytes, the .cgi file would win in this example. Users of .asis files might prefer to use the Handler option, if .asis files are associated with the asis-handler. You may finally allow Any extensions to match, even if mod_mime doesn't recognize the extension. This was the behavior in Apache 1.3 and can cause unpredictable results, such as serving .old or .bak files that the webmaster never expected to be served. 6.2.2 Image NegotiationImage negotiation is a special corner of general content negotiation because the Web has a variety of image files with different levels of support: for instance, some browsers can cope with PNG files and some can't, and the latter have to be sent the simpler, more old-fashioned, and bulkier GIF files. The client's browser sends a message to the server telling it which image files it accepts: HTTP_ACCEPT=image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, */* Browsers almost always lie about the content types they accept or prefer, so this may not be all that reliable. In theory, however, the server uses this information to guide its search for an appropriate file, and then it returns it. We can demonstrate the effect by editing our ... /htdocs/catalog_summer.html file to remove the .jpg extensions on the image files. The appropriate lines now look like this: ... <img src="bench" alt="Picture of a Bench"> ... <img src="hen" alt="Picture of a hencoop like a pagoda"> ... When Apache has the Multiviews option turned on and is asked for an image called bench, it looks for the smaller of bench.jpg and bench.gif — assuming the client's browser accepts both — and returns it. Apache v2 introduces a new directive, which is related to the Filter mechanism (see later in this chapter, Section 6.6). 6.3 Language NegotiationThe same useful functionality also applies to language. To demonstrate this, we need to make up .html scripts in different languages. Well, we won't bother with actual different languages; we'll just edit the scripts to say, for example: <h1>Italian Version</h1> and edit the English version so that it includes a new line: <h1>English Version</h1> Then we give each file an appropriate extension:
Apache recognizes language variants: en-US is seen as a version of general English, en, which seems reasonable. You can also offer documents that serve more than one language. If you had a "franglais" version, you could serve it to both English speakers and Francophones by naming it frangdoc.en.fr. Of course, in real life you would have to go to substantially more trouble, what with translators and special keyboards and all. Also, the Italian version of the index would need to point to Italian versions of the catalogs. But in the fantasy world of Butterthlies, Inc., it's all so simple. The Italian version of our index would be index.html.it. By default, Apache looks for a file called index.html.<something>. If it has a language extension, like index.html.it, it will find the index file, happily add the language extension, and then serve up what the browser prefers. If, however, you call the index file index.it.html, Apache will still look for, and fail to find, index.html.<something>. If index.html.en is present, that will be served up. If index.en.html is there, then Apache gives up and serves up a list of all the files. The moral is, if you want to deal with index filenames in either order — index.it.html alongside index.html.en — you need the directive: DirectoryIndex index to make Apache look for a file called index.<something> rather than the default index.html.<something>. To give Apache the idea, we need the corresponding lines in the httpd1.conf file: AddLanguage it .it AddLanguage en .en AddLanguage ko .ko Now our browser behaves in a rather civilized way. If you run ./go 1 on the server, go to the client machine, and go to Edit Preferences Languages (in Netscape 4) or Tools Internet Options Languages (MSIE) or wherever the language settings for your browser are kept, and set Italian to be first, you see the Italian version of the index. If you change to English and reload, you get the English version. It you then go to catalog_summer, you see the pictures even though we didn't strictly specify the filenames. In a small way...magic! Apache controls language selection if the browser doesn't. If you turn language preference off in your browser, edit the Config file (httpd2.conf ) to insert the line: LanguagePriority it en ko stop Apache and restart with ./go 2, the browser will get Italian.
The LanguagePriority directive sets the precedence of language variants for the case in which the client does not express a preference when handling a multiviews request. The MIME-lang list is in order of decreasing preference. For example: LanguagePriority en fr de For a request for foo.html, where foo.html.fr and foo.html.de both exist but the browser did not express a language preference, foo.html.fr would be returned. Note that this directive only has an effect if a "best" language cannot be determined by any other means. It will not work if there is a DefaultLanguage defined. Correctly implemented HTTP 1.1 requests will mean that this directive has no effect. How does this all work? You can look ahead to the environment variables in Chapter 16. Among them were the following: ... HTTP_ACCEPT=image/gif,image/x-bitmap,image/jpeg,image/pjpeg,*/* ... HTTP_ACCEPT_LANGUAGE=it ... Apache uses this information to work out what it can acceptably send back from the choices at its disposal.
The AddLanguage directive maps the given filename extension to the specified content language. MIME-lang is the MIME language of filenames containing extensions. This mapping is added to any already in force, overriding any mappings that already exist for the same extension. For example: AddEncoding x-compress .Z AddLanguage en .en AddLanguage fr .fr Then the document xxxx.en.Z will be treated as a compressed English document (as will the document xxxx.Z.en). Although the content language is reported to the client, the browser is unlikely to use this information. The AddLanguage directive is more useful for content negotiation, where the server returns one from several documents based on the client's language preference. If multiple language assignments are made for the same extension, the last one encountered is the one that is used. That is, for the case of: AddLanguage en .en AddLanguage en-uk .en AddLanguage en-us .en documents with the extension .en would be treated as being en-us. The extension argument is case insensitive and can be specified with or without a leading dot.
The DefaultLanguage directive tells Apache that all files in the directive's scope (e.g., all files covered by the current <Directory> container) that don't have an explicit language extension (such as .fr or .de as configured by AddLanguage) should be considered to be in the specified MIME-lang language. This allows entire directories to be marked as containing Dutch content, for instance, without having to rename each file. Note that unlike using extensions to specify languages, DefaultLanguage can only specify a single language. If no DefaultLanguage directive is in force and a file does not have any language extensions as configured by AddLanguage, then that file will be considered to have no language attribute.
The RemoveLanguage directive removes any language associations for files with the given extensions. This allows .htaccess files in subdirectories to undo any associations inherited from parent directories or the server config files. The extension argument is case insensitive and can be specified with or without a leading dot. 6.4 Type MapsIn the last section, we looked at multiviews as a way of providing language and image negotiation. The other way to achieve the same effects in the current release of Apache, as well as more lavish effects later (probably to negotiate browser plug-ins), is to use type maps, also known as *.var files. Multiviews works by scrambling together a plain vanilla type map; now you have the chance to set it up just as you want it. The Config file in .../site.typemap/conf/httpd1.conf is as follows: User webuser Group webgroup ServerName www.butterthlies.com DocumentRoot /usr/www/APACHE3/site.typemap/htdocs AddHandler type-map var DirectoryIndex index.var One should write, as seen in this file: AddHandler type-map var Having set that, we can sensibly say: DirectoryIndex index.var to set up a set of language-specific indexes. What this means, in plainer English, is that the DirectoryIndex line overrides the default index file index.html. If you also want index.html to be used as an alternative, you would have to specify it — but you probably don't, because you are trying to do something more elaborate here. In this case there are several versions of the index — index.en.html, index.it.html, and index.ko.html — so Apache looks for index.var for an explanation. Look at ... /site.typemap/htdocs. We want to offer language-specific versions of the index.html file and alternatives to the generalized images bath, hen, tree, and bench, so we create two files, index.var and bench.var (we will only bother with one of the images, since the others are the same). This is index.var : # It seems that this URI _must_ be the filename minus the extension... URI: index; vary="language" URI: index.en.html # Seems we _must_ have the Content-type or it doesn't work... Content-type: text/html Content-language: en URI: index.it.html Content-type: text/html Content-language: it This is bench.var : URI: bench; vary="type" URI: bench.jpg Content-type: image/jpeg; qs=0.8 level=3 URI: bench.gif Content-type: image/gif; qs=0.5 level=1 The first line tells Apache what file is in question, here index.* or bench.* ; vary tells Apache what sort of variation we have. These are the possibilities:
The name of the corresponding header, as defined in the HTTP specification, is obtained by prefixing these names with Content-. These are the headers:
The qs numbers are quality scores, from 0 to 1. You decide what they are and write them in. The qs values for each type of return are multiplied to give the overall qs for each variant. For instance, if a variant has a qs of .5 for Content-type and a qs of .7 for Content-language, its overall qs is .35. The higher the result, the better. The level values are also numbers, and you decide what they are. In order for Apache to decide rationally which possibility to return, it resolves ties in the following way:
If you can predict the outcome of all this in your head, you must qualify for some pretty classy award! Following is the full list of possible directives, given in the Apache documentation:
To throw this into action, start Apache with ./go 1, set the language of your browser to Italian (in Netscape, choose Edit Preferences Netscape Languages), and access http://www.butterthlies.com /. You should see the Italian version. MSIE seems to provide less support for some languages, including Italian. You just get the English version. When you look at Catalog-summer.html, you see only the Bench image (and that labeled as "indirect") because we did not create var files for the other images. 6.5 Browsers and HTTP 1.1Like any other human creation, the Web fills up with rubbish. The webmaster cannot assume that all clients will be using up-to-date browsers — all the old, useless versions are out there waiting to make a mess of your best-laid plans. In 1996, the weekly Internet magazine devoted to Apache affairs, Apache Week (Issue 25), had this to say about the impact of the then-upcoming HTTP 1.1:
Although time has passed, the situation has probably not changed very much. In addition, most browsers do not indicate a preference for particular types. This should be done by adding a preference factor (q) to the content type. For example, a browser that accepts Acrobat files might prefer them to HTML, so it could send an accept-type list that includes: content-type: text/html: q=0.7, application/pdf: q=0.8 When the server handles the request, it combines this information with its source quality information (if any) to pick the "best" content type to return. 6.6 FiltersApache v2 introduced a new mechanism called a "Filter", together with a reworking of Multiviews. The documentation says:
There is a demonstration filter that changes text to uppercase. In .../site.filter/htdocs we have two files, 1.txt and 1.html, which have the same contents: HULLO WORLD FROM site.filter The Config file is as follows: User webuser Group webgroup Listen 80 ServerName my586 AddOutputFilter CaseFilter html DocumentRoot /usr/www/APACHE3/site.filter/htdocs If we visit the site, we are offered a directory. If we choose 1.txt, we see the contents as shown earlier. If we choose 1.html, we find it has been through the filter and is now all uppercase: HULLO WORLD FROM SITE.FILTER The Directives are as follows:
AddInputFilter maps the filename extensions extension to the filter or filters that will process client requests and POST input when they are received by the server. This is in addition to any filters defined elsewhere, including the SetInputFilter directive. This mapping is merged over any already in force, overriding any mappings that already exist for the same extension. If more than one filter is specified, they must be separated by semicolons in the order in which they should process the content. Both the filter and extension arguments are case insensitive, and the extension may be specified with or without a leading dot.
The AddOutputFilter directive maps the filename extensions extension to the filters that will process responses from the server before they are sent to the client. This is in addition to any filters defined elsewhere, including the SetOutputFilter directive. This mapping is merged over any already in force, overriding any mappings that already exist for the same extension. For example, the following configuration will process all .shtml files for server-side includes. AddOutputFilter INCLUDES shtml If more than one filter is specified, they must be separated by semicolons in the order in which they should process the content. Both the filter and extension arguments are case insensitive, and the extension may be specified with or without a leading dot.
The SetInputFilter directive sets the filter or filters that will process client requests and POST input when they are received by the server. This is in addition to any filters defined elsewhere, including the AddInputFilter directive. If more than one filter is specified, they must be separated by semicolons in the order in which they should process the content.
The SetOutputFilter directive sets the filters that will process responses from the server before they are sent to the client. This is in addition to any filters defined elsewhere, including the AddOutputFilter directive. For example, the following configuration will process all files in the /www/data/ directory for server-side includes: <Directory /www/data/> SetOutputFilter INCLUDES </Directory> If more than one filter is specified, they must be separated by semicolons in the order in which they should process the content.
The RemoveInputFilter directive removes any input filter associations for files with the given extensions. This allows .htaccess files in subdirectories to undo any associations inherited from parent directories or the server config files. The extension argument is case insensitive and can be specified with or without a leading dot.
The RemoveOutputFilter directive removes any output filter associations for files with the given extensions. This allows .htaccess files in subdirectories to undo any associations inherited from parent directories or the server config files. The extension argument is case insensitive and can be specified with or without a leading dot.
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|